As
mentioned previously, in SQL Server 2008, the catalogs are now stored
inside the full-text engine. This redesign has resulted in many
architectural changes in SQL Server 2008 Full-Text Search.
The two main components of Full-Text Search are as follows:
Indexing
The indexing engine connects to
your database and extracts the content from the tables you are
full-text indexing. It then sends this stream to COM components called
filters (or IFilters). These COM components are run in an out-of-process service called the FT Daemon Host.
These filters are able to understand the content and can extract text
data from them. For example, if you store XML or Word documents in your
database, these filters can understand this data or binary data and emit
words and/or tokens it finds in there. The filters chosen are the
default text ones if you are using char, varchar, or text data types or XML if you are using the xml data type. If you are indexing varbinary
documents, the indexing engine reads the document type column and
launches the filter corresponding to the value stored in the document
type column.
If you are storing Word documents in a varbinary data type column, and in your full-text creating statement you specified a document type column called DocumentType, the contents of this column for that row should be doc, .doc, docx, or .docx.
You can obtain a list of filters in use by querying as follows:
select document_type from sys.fulltext_document_types
Each filter understands the
file format of the type of document it indexes. For example, the Word
filter understands the file formats for Word documents and emits the
textual data it finds in the Word documents; the XML filter understands
the XML documents and emits the textual data it finds in them.
If you need to index documents for which the file type does not appear in the results of sys.full_text_document_types, you need to install that filter on the server running SQL Server 2008 and then allow SQL Server 2008 to use them.
To allow SQL Server to use these third-party iFilters, you need to issue the following command:
sp_FullText_Service 'load_os_resources',1
This command loads the filter
if it is installed on the OS. In most cases, this is sufficient. In
many cases, SQL Server wants to verify the signature/certificate
embedded in the COM component/filter. This can cause problems in two
ways. First, the filter may not have a certificate, and when SQL Server
tries to validate the certificate with the issuing authority, it is
unable to do so. Second, the performance impact of having to validate
the certificate/signature causes the initial queries to take a long time
as the validation process proceeds. For these two reasons, you might
want to disable the certificate/signature check by using the following
command:
sp_FullText_Service 'verify_signature',0
Microsoft has published documentation on how to develop your own filters. For more information on how to do this, consult
http://msdn2.microsoft.com/en-us/library/ms916793.aspx
The filters then send the stream of textual data emitted by them to another component called word breakers. Word breakers respect the language you specified to be used to index your columns’ content.
The neutral word
breaker basically breaks words at whitespace boundaries and at
punctuation (, . : ; ’ “ ! -) and indexes only alphanumeric characters.
The English (U.S)
and British (or International English) word breakers index hyphenated
words without the hyphens and as their component words, so data-base is indexed as data, base, and database. They also index acronyms as single letters and the whole word if they are capitalized. For example, F.B.I. is indexed as f, b, i, and fbi (words are indexed lowercase).
The
English and British English word breakers are nearly identical, with
the exception that during the searching process, different stems may be
used. In U.S. English speakers may say oriented, whereas British English speakers may say orientated (in Canada oriented is now more common; however in the rest of the English-speaking world—with the exception of the United States—orientated is more common).
The German and Dutch word breakers index compound words as the compound and constituent words. For example, the German word Volkswagen is indexed as volks and wagen.
For Far Eastern
languages, the word breakers break the sentence at whitespace and then
go through the “words” and extract characters. In some Far Eastern
languages, characters appear contiguous to each other in blocks that
appear to Westerners as words. In fact, each character is a word unto
itself, and characters can be combined to form new words. These
characters may be indexed singly or in multiple character combinations.
By default, the word breaker used by the indexing process is the language specified in sp_configure unless you specify that you want the contents of the columns you are full-text indexing to be indexed in a different language:
exec sp_configure 'show advanced options',1
reconfigure with override
exec sp_configure 'default full-text language'
Some documents have
language-specific tags in them that launch different word breakers than
the ones you specify on your server or in your full-text index creation
statement. For example, Word and XML documents have language tags
embedded in them. If your Word documents are in German, and you specify
in your full-text index creation statement to use the French word
breakers, your Word document are indexed in German, not French.
When the word breakers have
done their work, the stop lists are applied and the stop lists are
removed. Then the words are sent to the full-text indexes. The full-text
indexes store positional information, so they know where a word occurs
in a document. These word positions also reflect stop list words that
were removed.
At any one time, there may be
multiple temporary memory resident full-text indexes. At certain
periods, these temporary full-text indexes are consolidated into a
single master full-text index. This process is called a master merge. You can force a master merge by reorganizing a catalog (using the T-SQL statement ALTER FULLTEXT CATALOG MyCatalog REORGANIZE, where your catalog is name MyCatalog) or optimizing (an option available to you in the Catalog Properties dialog).
Searching
Although the indexer
launches word breakers and filters as out-of-process SQL Server
components, the search process is entirely within the SQL Server engine.
To query the full-text indexes, you need to use CONTAINS or FREETEXT predicates or their rowset analogs (CONTAINSTABLE, FREETEXTTABLE).
Just
as the indexer applies the default server full-text language for
indexing, it also applies the default full-text language for searching.
Consider a search on the French word courir (to run). If you were to search in English on this word, it would search on courir and courirs. However, on a server with the default full-text language setting for French, your search would be conducted on couraient, courais, courait, courant, coure, courent, coures, courez, couriez, courions, courir, courons, courra, courrai, courraient, courrais, courrait, courras, courrez, courriez, courrions, courrons, courront, cours, court, couru, courue, courues, courumes, courumes, coururent, courus, courusse, courussent, courusses, courussiez, courussions, courut, courutes.
Now that you understand the architecture of Full-Text Search, let’s discuss how to create full-text catalogs.
Note
The 2005 version of the AdventureWorks database can be
installed using the same installer that installs the AdventureWorks2008
or AdventureWorks2008R2 database. If you didn’t install AdventureWorks
when you installed either of these sample databases, simply relaunch the
installer and choose to install the AdventureWorks OLTP database.